ReBNN: Resilient Binary Neural Network

87

in which γn

i

is a balanced parameter. Based on the objective, the weight gradient in

Eq. (3.141) becomes:

δwn

i =L

wn

i

+ γn

i (wn

i αn

i bwn

i )

= αn

i (L

ˆwn

i

1|wn

i |≤1γn

i bwn

i ) + γn

i wn

i .

(3.144)

The Sn

i (αn

i , wn

i ) = γn

i (wn

i αn

i bwn

i ) is an additional term added in the backpropagation

process. We add this element because too small αn

i diminishes the gradient δwn

i and causes

a constant weight wn

i . In what follows, we state and prove the proposition that δwn

i,j is

a resilient gradient for a single weight wn

i,j. Sometimes we omit the subscript i, j and the

superscript n for an easy representation.

Proposition 1. The additional term S(α, w) = γ(wαbw) achieves a resilient training

process by suppressing frequent weight oscillation. Its balanced factor γ can be considered

the parameter that controls the appearance of the weight oscillation.

Proof: We prove the proposition by contradiction. For a single weight w centering around

zero, the straight-through-estimator 1|w|≤1 = 1. Thus, we omit it in the following. Based

on Eq. (3.144), with a learning rate η, the weight updating process is formulated as:

wt+1 = wt ηδwt

= wt η[αt( L

ˆwtγbwt) + γwt]

= (1ηγ)wt ηαt( L

ˆwtγbwt)

= (1ηγ)

wt

ηαt

(1ηγ)(L

ˆwtγbwt)

,

(3.145)

where t denotes the t-th training iteration and η represents the learning rate. Different

weights share different distances from the quantization level ±1. Therefore, their gradients

should be modified according to their scaling factors and current learning rate. We first

assume the initial state bwt =1, and the analysis process applies to the case of initial

state bwt = 1. The oscillation probability from iteration t to t + 1 is the following:

P(bwt ̸= bwt+1)



bwt=1 P(L

ˆwt ≤−γ).

(3.146)

Similarly, the oscillation probability from iteration t + 1 to t + 2 is as follows:

P(bwt+1 ̸= bwt+2)



bwt+1=1 P(

L

ˆwt+1γ).

(3.147)

Thus, the sequential oscillation probability from iteration t to t + 2 is as follows:

P((bwt+1 ̸= bwt+2)(bwt+1 ̸= bwt+2))|bwt=1

P



( L

ˆwt ≤−γ)(

L

ˆwt+1γ)



,

(3.148)

which denotes that the weight oscillation occurs only if the magnitudes of

L

ˆwt and

L

ˆwt+1

are more significant than γ. As a result, its attached factor γ can be considered a

parameter used to control the occurrence of the weight oscillation.